Learning-Based Dissimilarity for Clustering Categorical Data

نویسندگان

چکیده

Comparing data objects is at the heart of machine learning. For continuous data, object dissimilarity usually taken to be distance; however, for categorical there no universal agreement, categories can ordered in several different ways. Most existing category measures characterize distance among values an attribute may take using precisely number takes (the space) and frequency which they occur. These kinds overlook interdependence, provide valuable information when capturing per-attribute dissimilarity. In this paper, we introduce a novel measure that call Learning-Based Dissimilarity, comparing data. Our characterizes between two given terms how likely it such are confused or not all dataset with remaining attributes used predict them. To end, algorithm that, target attribute, first learns classification model order compute confusion matrix attribute. Then, our method transforms into measure. We have successfully tested against 55 datasets gathered from University California, Irvine (UCI) Machine Learning Repository. results show surpasses, various performance indicators clustering, most prominent relations put forward literature.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Context-Based Distance Learning for Categorical Data Clustering

Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of the same categorical attribute, since they are not ordered. In this paper, we propose a method to learn a context-based distance for categorical attributes. The key intuition of this work is that the d...

متن کامل

An association-based dissimilarity measure for categorical data

In this paper, we propose a novel method to measure the dissimilarity of categorical data. The key idea is to consider the dissimilarity between two categorical values of an attribute as a combination of dissimilarities between the conditional probability distributions of other attributes given these two values. Experiments with real data show that our dissimilarity estimation method improves t...

متن کامل

Clustering-Based Categorical Data Protection

The need of improving the privacy on public datasets is becoming more and more important because the number of public available datasets is growing very fast. This forced the continuous research to find better protection methods that prevent the disclosure of the entities or individuals in a dataset while preserving the data utility. In this paper we present a new approach for categorical data ...

متن کامل

Distance based Clustering for Categorical Data

Learning distances from categorical attributes is a very useful data mining task that allows to perform distance-based techniques, such as clustering and classification by similarity. In this article we propose a new context-based similarity measure that learns distances between the values of a categorical attribute (DILCA DIstance Learning of Categorical Attributes). We couple our similarity m...

متن کامل

Dissimilarity-based learning for complex data

Rapid advances of information technology have entailed an ever increasing amount of digital data, which raises the demand for powerful data mining and machine learning tools. Due to modern methods for gathering, preprocessing, and storing information, the collected data become more and more complex: a simple vectorial representation, and comparison in terms of the Euclidean distance is often no...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Applied sciences

سال: 2021

ISSN: ['2076-3417']

DOI: https://doi.org/10.3390/app11083509